Getting Started

This document provides a very brief overview to basic progamming in R adapted from training provided by Prof. Ross Ihaka.

Further information can be found at https://www.stat.auckland.ac.nz/~ihaka/?Teaching

Topics include:

  • Setting Up
  • R 101: The Basics
  • Creating Plots
  • Writing Functions
  • R and Hilltop
  • Report Writing

Set up

To start we need to download R, as of Oct 2018 the latest version is 3.5.1 (Feather Spray). I also highly recommend downloading RStudio - an integrated development environment (IDE) for R.

Starting to Code

Basic calculations

One way to think of R is as a very powerful calculator. If we type a calculation into the console an answer is printed out.

# Performing calculations using mathematical notation
2 + 2
## [1] 4
# Precedence can be set using parentheses
(10 + 5) / 2
## [1] 7.5

Once R has completed the computation the answer is discarded. To ‘save’ a result you can assign a name using <- or = to create a variable. Variables can then be used in subsequent expressions.

# Create a variable by assigning a name 'a'
a <- 5^2

Notice no result is printed

# Print the result named 'a'
a
## [1] 25
# Calculate the square root of 'a'
sqrt(a)
## [1] 5

Vectors

R operates on named data structures - the most basic of which is a vector. Vectors contain an indexed collection of values of the same type.

Vectors are usually created with the function c()

x <- c(1, 3, 5, 7, 9)
y <- c(TRUE, FALSE, TRUE)

Check the type of a variable

class(x)
## [1] "numeric"
class(y)
## [1] "logical"

Determine the number of elements in a vector

length(x)
## [1] 5
length(y)
## [1] 3
# Be careful of mixing types
c(1, 5.5, 7)
## [1] 1.0 5.5 7.0
c(1, TRUE, "Hello")
## [1] "1"     "TRUE"  "Hello"

Number sequences

Creating vectors by a sequence

5:10
## [1]  5  6  7  8  9 10
seq(1, 10, by = 2)
## [1] 1 3 5 7 9
-12.5:12
##  [1] -12.5 -11.5 -10.5  -9.5  -8.5  -7.5  -6.5  -5.5  -4.5  -3.5  -2.5
## [12]  -1.5  -0.5   0.5   1.5   2.5   3.5   4.5   5.5   6.5   7.5   8.5
## [23]   9.5  10.5  11.5

Vector arithmetic

x
## [1] 1 3 5 7 9
x + 10
## [1] 11 13 15 17 19
sqrt(x)
## [1] 1.000000 1.732051 2.236068 2.645751 3.000000
1:10/2
##  [1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Subsetting vectors

Values can be subset out a vector using square brackets [ ]

x
## [1] 1 3 5 7 9
x[3]
## [1] 5
x[2:4]
## [1] 3 5 7
x[-3]
## [1] 1 3 7 9

Conditional subsetting

x > 5
## [1] FALSE FALSE FALSE  TRUE  TRUE
x[x > 5]
## [1] 7 9

Named vectors

names(x) <- c("Jan", "Mar", "May", "Jul", "Sep")
x
## Jan Mar May Jul Sep 
##   1   3   5   7   9
x["May"]
## May 
##   5

Summarizing data

c(mean = mean(x), sd = sd(x))
##     mean       sd 
## 5.000000 3.162278
quantile(x)
##   0%  25%  50%  75% 100% 
##    1    3    5    7    9
cumsum(x)
## Jan Mar May Jul Sep 
##   1   4   9  16  25

Tip: To get information about a function try ?function name

Dataframes

Dataframes are one of the ways R presents tabular data.

A dataframe is comprised of columns and rows, both of which can be named, of the same length. Each column holds the same type of data.

rain <- read.csv("Data/MIL_DailyRainfall.csv")
rain$Date <- as.Date(rain$Date)

# Display the start or end of a dataframe with head() and tail()
head(rain)

Subsetting dataframes

Dataframes can be subset using [row, column]

rain[1, 1]
## [1] "2017-01-01"
rain[1,]
head(rain[,1])
## [1] "2017-01-01" "2017-01-02" "2017-01-03" "2017-01-04" "2017-01-05"
## [6] "2017-01-06"

Building a PCAL dataframe

# Create a month column for calculating statistics
rain$Month <- factor(format(rain$Date, "%b"), levels = month.abb)

head(rain)

Note that you can call a column by name with $

# # Calculate monthly total rainfall
# rainTotal <- tapply(rain$Rainfall, rain$Month, sum)
# 
# # Create a dataframe with month and total rainfall
# df <- data.frame(Month = month.abb,
#                  Total = rainTotal)
# 
# head(df)

Add min, mean and max to the dataframe

# Add min, mean and max to the dataframe
pcal <- cbind(df, 
              Min = tapply(rain$Rainfall, rain$Month, min), 
              Mean = tapply(rain$Rainfall, rain$Month, mean), 
              Max = tapply(rain$Rainfall, rain$Month, max))

# Drop the rownames
row.names(pcal) <- 1:12


head(pcal)
##   df Min Mean     Max 
## 1 ?  0   2.854839 36.5
## 2 ?  0   2.75     29.5
## 3 ?  0   3.096774 18.9
## 4 ?  0   6.416667 45.5
## 5 ?  0   3.354839 21.5
## 6 ?  0   1.083333 13

Rounding numbers to make the table more user friendly

# Add min, mean and max to the dataframe
pcal <- cbind(df, 
              Min = tapply(rain$Rainfall, rain$Month, min), 
              ### <b>
              Mean = round(tapply(rain$Rainfall, rain$Month, mean), 2), 
              ### </b>
              Max = tapply(rain$Rainfall, rain$Month, max))

# Drop the rownames
row.names(pcal) <- 1:12


head(pcal)
##   df Min Mean Max 
## 1 ?  0   2.85 36.5
## 2 ?  0   2.75 29.5
## 3 ?  0   3.1  18.9
## 4 ?  0   6.42 45.5
## 5 ?  0   3.35 21.5
## 6 ?  0   1.08 13

Making plots

Base R plotting

R has built in plotting capabilities - to view more details about possible graphical parameters try ?plot

x <- data.frame(Month = 1:12, 
                Flow = c(25.4, 38.6, 27.3, 56.5, 53.4, 48.7, 
                         59.1, 62.1, 60.1, 69.8, 25.3, 13.5))

A basic plot

plot(x)

Improving the plot

plot(x, main = "Manawatu at Teachers College", pch = 16, col = 1:12, type = "b",
     xlab = "2017", ylab = "Avg Monthly Flow (cumecs)")

Adding more information to our plot

# Choosing colours for the points based on flow values
cols <- vector(length = 12)
cols[x$Flow < 30] <- "red"
cols[x$Flow > 60] <- "blue"
cols[x$Flow >= 30 & x$Flow <= 60] <- "black"

—-_

plot(x, main = "Manawatu at Teachers College Monthly Flow", xaxt = "n",
     xlab = "2017", ylab = "Avg Monthly Flow (cumecs)")
rect(6, 0, 9, 72, col = "lightgrey")
text(7.5, 45, "Winter")
lines(x$Month, x$Flow)
points(x, pch = 16, col = cols)
abline(h = 30, lty = 2)
abline(h = 60, lty = 2)
text(c(0, 20), "Low flows")
text(c(0, 65), "High flows")
axis(1, x$Month, month.abb)

Add a legend

par(oma=c(0, 0, 0, 5))

plot(x, main = "Manawatu at Teachers College Monthly Flow", xaxt = "n",
     xlab = "2017", ylab = "Avg Monthly Flow (cumecs)")
rect(6, 0, 9, 72, col = "lightgrey")
text(7.5, 45, "Winter")
lines(x$Month, x$Flow)
points(x, pch = 16, col = cols)
abline(h = 30, lty = 2)
abline(h = 60, lty = 2)
text(c(0, 20), "Low flows")
text(c(0, 65), "High flows")
axis(1, x$Month, month.abb)

legend(x = par('usr')[2], y = par('usr')[4], bty='n', xpd=NA, pch = 16, col = unique(cols),
       legend = c("Low flow", "Moderate flow", "High flow"))


Other libraries for plotting

There are a multitude of different packages for plotting. Some of the most popular include:

  • ggplot2
  • htmlwidgets


ggplot2

ggplot2 is part of the tidyverse package collection. Very basically it works by mapping the aesthetics of a dataset and then adding layers, scales, facets, etc… to build up a plot. If you would like to learn more about tidyverse coding I highly recommend R for data science by Hadley Wickham.

library(ggplot2)

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy, color = class))

htmlwidgets are a collection of packages which provide an R interface to JavaScript libraries to create interactive visualisations. There are over 50 widgets available including dedicated options for mapping and timeseries. To see what widgets exist visit htmlwidgets for R


htmlwidgets

library(plotly)
plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, color = ~Species, type = "scatter")

library(dygraphs)
dygraph(nhtemp, main = "New Haven Temperatures") %>% 
  dyRangeSelector(dateWindow = c("1920-01-01", "1960-01-01"))

library(leaflet)
leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=174.768, lat=-36.852, popup="The birthplace of R")

Functions in R

Built in functions

Functions in R are denoted as: function name( )

Function parameters or arguments are separated by a comma and may be named.

We have used a number of functions already such as;

x <- c(1:5)

length(x)
## [1] 5
plot(x, main = "A Plot", col = x, pch = 16)

Tip: To view a function’s arguments try typing in function name() and hit tab while the cursor is inside the parentheses.

Vectorization

Vectorization is a fundamental principle in R. It refers to the ability to perform operations on a vector as a whole rather than element by element.

For example we can perform vector arithmetic;

x <- c(1, 2, 3)
y <- c(10, 20, 30)
x+y
## [1] 11 22 33

Or we can pass a vector to a function and receive a result for each element;

sqrt(x)
## [1] 1.000000 1.414214 1.732051

Recycling

The recycling rule is how R handles vector operations on vectors of different length. When the operation requires that both vectors be the same length R will automatically recycle the values of the shorter vector until both vectors are the same length.

For example;

x + 10
## [1] 11 12 13
x[c(TRUE, FALSE)]
## [1] 1 3

A Calculator Curiosity

This example is taken from the book Professor Stewart’s Hoard of Mathematical Treasures by Ian Stewart. The solution code was provided by Prof. Ross Ihaka.

(8 x 8) + 13

(8 x 88) + 13

(8 x 888) + 13

(8 x 8888) + 13

(8 x 88888) + 13

(8 x 888888) + 13

(8 x 8888888) + 13

(8 x 88888888) + 13

A Calculator Curiosity: A vectorised solution

The sequences of 8 can be decomposed into 8, 8 + 80, 8 + 80 + 800, ect…

Those values are the cummulative sum of the sequence 8, 80, 800, etc…

That sequence is equivalent to 8x10^0, 8x10^1, 8x10^2, etc…

So our code is as follows;

n = 8
cumsum(8 * 10^(1:n - 1))
## [1]        8       88      888     8888    88888   888888  8888888 88888888
8 * cumsum(8 * 10^(1:n - 1)) + 13
## [1]        77       717      7117     71117    711117   7111117  71111117
## [8] 711111117

Control flow

There are several control constructs available in R, most commonly;

if(condition){
  expression
}
if(condition){
  expression_1
}else{
  expression_2
}
for(variable in vector){
  expression
}

A Calculator Curiosity: A control flow solution

How can we solve the calculator curiosity problem using control flow?

The most logical method for solving our problem is using a for loop.

Again building up the 8, 88, 888… is key.

We can do this by multiplying the previous value by 10 and adding 8.

A Calculator Curiosity: A control flow solution

numSeq = 8

for (i in 1:8) {
  print(8 * numSeq + 13)
  
  numSeq <- 10 * numSeq + 8
}
## [1] 77
## [1] 717
## [1] 7117
## [1] 71117
## [1] 711117
## [1] 7111117
## [1] 71111117
## [1] 711111117

Writing your own functions

At some point you may want to start writing your own functions.

This can easily be acomplished in R using the structure

functionName <- function(arg_1, arg_2 = defaultValue){

                         Expressions to do something

                         return(results)
                }

You can then call your function with the default value

functionName(x)

or if you would like to overwrite the default value

functionName(x, arg_2 = y)

A Calculator Curiosity: A function

Now that we can calculate the problem, how can we make a function to output any number of answers?

Our vectorised version is already a simple expression using 1 input:

curiosity_1 <- function(n) 8 * cumsum(8 * 10^(1:n - 1)) + 13
curiosity_1(5)
## [1]     77    717   7117  71117 711117

A Calculator Curiosity: A function

Our control flow solution is a bit more complicated. We need some way to store the results instead of just printing them out.

curiosity_2 <- function(n){
  ans <- numeric(n)
  numSeq = 8

  for (i in 1:n) {
    ans[i] <- 8 * numSeq + 13
    
    numSeq <- 10 * numSeq + 8
  }
  return(ans)
}
curiosity_2(3)
## [1]   77  717 7117